skip to main content
10.1145/3471274.3471279acmotherconferencesArticle/Chapter ViewAbstractPublication Pageshp3cConference Proceedingsconference-collections
research-article

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

Published:26 August 2021Publication History

ABSTRACT

Data-intensive applications are becoming increasingly popular. However, only a few of them with high volume can afford dedicated hardware acceleration (such as Neural Network Processor, or NPU) or platform-specific software implementation (such as Tensorflow running on GPU). In this paper, we propose a hardware and software transparent framework for the acceleration of general-purpose data-intensive applications. Our framework is based on a key insight that most data-intensive applications spend the vast majority of their execution time on some inner loops with abundant opportunities for Data-Level Parallelism (DLP). In particular, we propose SALAD, a static analyzer for loop acceleration by exploiting DLP in hot loops under the LLVM (LLVM compiler infrastructure) framework. In contrast to traditional DLP exploration techniques, SALAD is both software and architectural transparent, without the need to change either the source code or binary code, and does not need vectorized instruction set architecture (ISA) extensions. Instead, it directly works on the program binary code and generates a profile for DLP opportunities in the binary. This profile will be fed to the hardware accelerator transparently to speed up execution. With the experiments result, we estimate that the DLP information provided by SALAD could result in 3.6x-60.2x speedups on a set of benchmarks, depending on their inherent DLP.

References

  1. N. P. Jouppi, C. Young, N. Patil In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA'17), New York, NY, USA, 2017, pp. 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. GCC Team, Dorit Naishlos. "Autovectorization in GCC," Retrieved from https://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf, June 2004.Google ScholarGoogle Scholar
  3. Gil Rapaport and Ayal Zaks, "Introducing VPlan to the Loop Vectorizer," European LLVM Developers' Meeting 2017 Retrieved from http://llvm.org/devmtg/2017-03//assets/slides/introducing_vplan_to_the_loop_vectorizer.pdfGoogle ScholarGoogle Scholar
  4. M. D. Ernst, "Static and dynamic analysis: Synergy and duality," in Proc. Workshop Dynamic Anal., May 9, 2003, pp. 24–27.Google ScholarGoogle Scholar
  5. W. Heirman, D. Stroobandt, N. R. Miniskar, R. Wuyts and F. Catthoor, "PinComm: Characterizing Intra-application Communication for the Many-Core Era," 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, 2010, pp. 500-507Google ScholarGoogle Scholar
  6. I. Ashraf, N. Khammassi, M. Taouil, and K. Bertels, "Memory and Communication Profiling for Accelerator-Based Platforms," IEEE Transactions on Computers, vol. 67, no. 7, pp. 934–948, Jul. 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Asanovic, D. A. Patterson, and C. Celio, "The berkeley out-of-order machine (boom): An industry-competitive, synthesizable, parameterized RISC-V processor," University of California at Berkeley Berkeley United States, Tech. Rep., 2015.Google ScholarGoogle Scholar
  8. S. Srinath, B. Ilbeyi, M. Tan, G. Liu, Z. Zhang and C. Batten, "Architectural Specialization for Inter-Iteration Loop Dependence Patterns," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 583-595.Google ScholarGoogle Scholar
  9. Karthikeyan Sankaralingam, S. W. Keckler, W. R. Mark and D. Burger, "Universal mechanisms for data-parallel architectures," Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 303-314.Google ScholarGoogle Scholar
  10. T. Nowatzki, V. Gangadhar, N. Ardalani and K. Sankaralingam, "Stream-dataflow acceleration," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA'17), Toronto, ON, 2017, pp. 416-429.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications
    June 2021
    71 pages
    ISBN:9781450389648
    DOI:10.1145/3471274

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 August 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited
  • Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format